Chapter 11 XML

This chapter shows you how to process the recently released BNC 2014, which is by far the largest representative collection of spoken English collected in UK. For the purpose of our in-class tutorials, I have included a small sample of the BNC2014 in our demo_data. However, the whole dataset is now available via the official website: British National Corpus 2014. Please sign up for the complete access to the corpus if you need this corpus for your own research.

11.1 BNC Spoken 2014

XML is similar to HTML. Before you process the data, you need to understand the structure of the XML tags in the files. Other than that, the steps are pretty much similar to what we have done before.

First, we read the XML using read_html():

Now it is intuitive that our next step is to extract all utterances (with the tag of <u>...</u>) in the XML file. So you may want to do the following:

## [1] "\r\nanhourlaterhopeshestaysdownratherlate"                    
## [2] "\r\nwellshehadthosetwohoursearlier"                           
## [3] "\r\nyeahIknowbutthat'swhywe'reanhourlateisn'tit?mmI'mtirednow"
## [4] "\r\n"                                                         
## [5] "\r\ndidyoutext--ANONnameM"                                    
## [6] "\r\nyeahyeahhewrotebacknobotherlad"

See the problem?

Using the above method, you lose the word boundary information from the corpus.

What if you do the following?

##  [1] "an"      "hour"    "later"   "hope"    "she"     "stays"   "down"   
##  [8] "rather"  "late"    "well"    "she"     "had"     "those"   "two"    
## [15] "hours"   "earlier" "yeah"    "I"       "know"    "but"

At the first sight, probably it seems that we have solved the problem but we don’t. There are even more problems created:

  • Our second method does not extract non-word tokens within each utterance (e.g., <pause .../>, <vocal .../>)
  • Our second method loses the utterance information (i.e., we don’t know which utterance each word belongs to)

Exercise 11.1 Please come up with a way to extract both words and non-word tokens from each utterance. Ideally, the resulting data frame would consist of rows being the utterances, and columns recording the attributes of each autterances.

Most importantly, the data frame should record not only the tokens of the utterance but at the same time the token-level attributes of each word/non-word token as well, e.g., the parts-of-speech, duration of pause etc.

11.2 Process the Whole Directory of BNC2014 Sample

11.2.1 Define Function

In Section 11.1, if you have figured how to extract utterances as well as token-based information from the xml file, you can easily wrap the whole procedure as one function. With this function, we can perform the same procedure to all the xml files of the BNC2014.

For example, let’s assume that we have defined a function:

read_xml_bnc2014 <- function(xml){
  ...
}

This function takes one xml file as an argument and return a data frame, consisting of utterances and other relevant information from the xml.

Exercise 11.2 Now your job is to write this function, read_xml_BNC2014().

11.2.2 Process the all files in the Directory

Now we utilize the self-defined function, read_xml_BNC2014(), and process all xml files in the demo_data/corp-bnc-spoken2014-sample/. Also, we combine the data.frame returned from each xml into a bigger one, i.e., corp_bnc_df:

## Time difference of 2.161701 mins

It takes about one and half minute to process the sample directory. You may store this corp_bnc_df data frame output for later use so that you don’t have to process the XML files every time you work on BNC2014.

11.3 Metadata

The best thing about BNC2014 is its rich demographic information relating to the settings and speakers of the conversations collected. The whole corpus comes with two metadata sets:

  • bnc2014spoken-textdata.tsv: metadata for each text transcript
  • bnc2014spoken-speakerdata.tsv: metadata for each speaker ID

These two metadata sets allow us to get more information about each transcript as well as the speakers in those transcripts.

11.3.1 Text Metadata

11.3.2 Speaker Metadata

11.4 BNC2014 for Socialinguistic Variation

BNC2014 was born for the study of socialinguistic variation. Here we show you some naitve examples, but you should get the ideas.

11.4.1 Word Frequency vs. Gender

Now we are ready to explore the gender differences in language.

11.4.1.1 Preprocessing

To begin with, there are some utterances with no words at all. We probably like to remove these tokens.

11.4.1.2 Target Structures

Let’s assume that we like to know which verbs are most frequently used by men and women.

  • Female wordcloud
  • Male wordcloud

11.4.2 Degree ADV + ADJ

11.4.3 Trigrams

Exercise 11.3 Remove stopwords from the frequency list of words (unigrams).
Exercise 11.4 Include dispersion metrics in the n-gram freuency list.